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Abstract — We study boosting algorithms from a new perspective. We show that the Lagrange dual problems of h norm regularized 
AdaBoost, LogitBoost and soft-margin LPBoost with generalized hinge loss are all entropy maximization problems. By looking at 
the dual problems of these boosting algorithms, we show that the success of boosting algorithms can be understood in terms 
of maintaining a better margin distribution by maximizing margins and at the same time controlling the margin variance. We also 
theoretically prove that, approximately, i\ norm regularized AdaBoost maximizes the average margin, instead of the minimum margin. 
The duality formulation also enables us to develop column generation based optimization algorithms, which are totally corrective. 
We show that they exhibit almost identical classification results to that of standard stage-wise additive boosting algorithms but with 
much faster convergence rates. Therefore fewer weak classifiers are needed to build the ensemble using our proposed optimization 
technique. 

Index Terms — AdaBoost, LogitBoost, LPBoost, Lagrange duality, linear programming, entropy maximization. 



1 Introduction 

BOOSTING has attracted a lot of research interests since 
the first practical boosting algorithm, AdaBoost, was 
introduced by Freund and Schapire The machine learn- 
ing community has spent much effort on understanding 
how the algorithm works 01/ 0L 0]- However, up to date 
there are still questions about the success of boosting that 
are left unanswered 0]. In boosting, one is given a set of 
training examples x t G X, i = 1 • ■ ■ M, with binary labels 
Hi being either +1 or — 1. A boosting algorithm finds a 
convex linear combination of weak classifiers (a.k.a. base 
learners, weak hypotheses) that can achieve much better 
classification accuracy than an individual base classifier. To 
do so, there are two unknown variables to be optimized. 
The first one is the base classifiers. An oracle is needed 
to produce base classifiers. The second one is the positive 
weights associated with each base classifier. 

AdaBoost is one of the first and the most popular boost- 
ing algorithms for classification. Later, various boosting 
algorithms have been advocated. For example, LogitBoost 
by Friedman et al. 0] replaces AdaBoost's exponential cost 
function with the function of logistic regression. MadaBoost 
0] instead uses a modified exponential loss. The authors 
of 0] consider boosting algorithms with a generalized 
additive model framework. Schapire et al. |2[] showed that 
AdaBoost converges to a large margin solution. However, 
recently it is pointed out that AdaBoost does not converge 
to the maximum margin solution 0], 0]. Motivated by 
the success of the margin theory associated with support 
vector machines (SVMs), LPBoost was invented by 0], lioll 
with the intuition of maximizing the minimum margin of 
all training examples. The final optimization problem can 
be formulated as a linear program (LP). It is observed 



that the hard-margin LPBoost does not perform well in 
most cases although it usually produces larger minimum 
margins. More often LPBoost has worse generalization 
performance. In other words, a higher minimum margin 
would not necessarily imply a lower test error. Breiman lllll 
also noticed the same phenomenon: his Arc-Gv algorithm 
has a minimum margin that provably converges to the 
optimal but Arc-Gv is inferior in terms of generalization ca- 
pability. Experiments on LPBoost and Arc-Gv have put the 
margin theory into serious doubt. Until recently, Reyzin and 
Schapire fl2ll re-ran Breiman's experiments by controlling 
weak classifiers' complexity. They found that the minimum 
margin is indeed larger for Arc-Gv, but the overall margin 
distribution is typically better for AdaBoost. The conclusion 
is that the minimum margin is important, but not always 
at the expense of other factors. They also conjectured that 
maximizing the average margin, instead of the minimum 
margin, may result in better boosting algorithms. Recent 
theoretical work 11311 has shown the important role of the 
margin distribution on bounding the generalization error 
of combined classifiers such as boosting and bagging. 

As the soft-margin SVM usually has a better classification 
accuracy than the hard-margin SVM, the soft-margin LP- 
Boost also performs better by relaxing the constraints that 
all training examples must be correctly classified. Cross- 
validation is required to determine an optimal value for the 
soft-margin trade-off parameter. Ratsch et al. Il3l showed 
the equivalence between SVMs and boosting-like algori- 
thms. Comprehensive overviews on boosting are given by 
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We show in this work that the Lagrange duals of t\ norm 
regularized AdaBoost, LogitBoost and LPBoost with gener- 
alized hinge loss are all entro py maximization problems. 
Previous work like jl^l , flill , Il9ll noticed the connection 
between boosting techniques and entropy maximization 
based on Bregman distances. They did not show that the 
duals of boosting algorithms are actually entropy regular- 
ized LPBoost as we show in (TO) , < TS8b and {3T). By know- 
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ing this duality equivalence, we derive a general column 
generation (CG) based optimization framework that can be 
used to optimize arbitrary convex loss functions. In other 
words, we can easily design totally-corrective AdaBoost, 
LogitBoost and boosting with generalized hinge loss, etc. 
Our major contributions are the following: 

1) We derive the Lagrangian duals of boosting algo- 
rithms and show that most of them are entropy 
maximization problems. 

2) The authors of llill conjectured that "it may be fruitful 
to consider boosting algorithms that greedily maxi- 
mize the average or median margin rather than the 
minimum one". We theoretically prove that, actually, 
l\ norm regularized AdaBoost approximately maxi- 
mizes the average margin, instead of the minimum 
margin. This is an important result in the sense that 
it provides an alternative theoretical explanation that 
is consistent with the margins theory and agrees with 
the empirical observations made by fl3l . 

3) We propose AdaBoost-QP that directly optimizes the 
asymptotic cost function of AdaBoost. The experi- 
ments confirm our theoretical analysis. 

4) Furthermore, based on the duals we derive, we design 
column generation based optimization techniques for 
boosting learning. We show that the new algorithms 
have almost identical results to that of standard stage- 
wise additive boosting algorithms but with much 
faster convergence rates. Therefore fewer weak clas- 
sifiers are needed to build the ensemble. 

The following notation is used. Typically, we use bold 
letters u, v to denote vectors, as opposed to scalars u, v 
in lower case letters. We use capital letters U, V to denote 
matrices. All vectors are column vectors unless otherwise 
specified. The inner product of two column vectors u and 
v are vJv = J2i u i v i- Component-wise inequalities are 
expressed using symbols >=, >-, -<; e.g., u fc= v means for 
all the entries Ui > u^. and 1 are column vectors with 
each entry being and 1 respectively. The length will be 
clear from the context. The abbreviation s.t. means "subject 
to". We denote the domain of a function /(•) as dom/. 

The paper is organized as follows. Section [2] briefly 
reviews several boosting algorithms for self-completeness. 
Their corresponding duals are derived in Section [3] Our 
main results are also presented in Section [3j In Section 2J 
we then present numerical experiments to illustrate various 
aspects of our new algorithms obtained in Section [3] We 
conclude the paper in the last section. 

2 Boosting Algorithms 

We first review some basic ideas and the corresponding op- 
timization problems of AdaBoost, LPBoost and LogitBoost, 
which are of interest in this present work. 

Let Tt be a class of base classifier H = {hj(-) : X — > 
M.,j = 1 • ■ ■ N}. A boosting algorithm seeks for a convex 
linear combination 



where w is the weak classifier weights to be optimized. 
AdaBoost calls an oracle that selects a weak classifier hj(-) 
at each iteration j and then calculates the weight Wj associ- 
ated with hj{-). It is shown in @], j2cj that AdaBoost (and 
many others like LogitBoost) performs coordinate gradient 
descent in function space, at each iteration choosing a 
weak classifier to include in the combination such that the 
cost function is maximally reduced. It is well known that 
coordinate descent has a slow convergence in many cases. 
From an optimization point of view, there is no particular 
reason to keep the weights • • • ,Wj-i fixed at iteration 
j. Here we focus on the underlying mathematical programs 
that boosting algorithms minimize. 

AdaBoost has proved to minimize the exponential loss 
function 11711 : 



M 

min y exp(- 

w — J 

i=l 



yiF(xi)), s.t. w fc= 0. 



(2) 



Because the logarithmic function log(-) is a strictly mono- 
tonically increasing function, AdaBoost equivalently solves 



M 



lin log '^2cxp(-y i F(x l )) , s.t. w ^ 0, l 7 



T ' 



(3) 

Note that in the AdaBoost algorithm, the constraint l T ii> = 
ij; is not explicitly enforced. However, without this regu- 
larization constraint, in the case of separable training data, 
one can always make the cost function approach zero via 
enlarging the solution w by an arbitrarily large factor. Here 
what matters is the sign of the classification evaluation 
function. Standard AdaBoost seems to select the value of 
T by selecting how many iterations it runs. Note that 
the relaxed version l T u> < i is actually equivalent to 
l T w = With the constraint l T u; < ±, if the final 
solution has l T w < ^, one can scale w such that 1 t «j = ^ 
and clearly the scaled w achieves a smaller loss. So the 
optimum must be achieved at the boundary. 

The boosting algorithm introduced in 10 is a t\ norm 
regularized version of the original AdaBoost because it is 
equivalent to 



min log 



1 



' M \ 

^2exp(~ yi F(xi)) J + ^1 w, s.t. w )p 0. 

\i=l J 



n«o=Ef= 



Ujhj(x), 



(1) 



(4) 

For a certain T, one can always find a T' such that 10 and 
(ID have exactly the same solution. Hereafter, we refer to 
this algorithm as AdaBoost^ i. 

We will show that it is very important to introduce this 
new cost function. All of our main results on AdaBoost^i 
are obtained by analyzing this logarithmic cost function, 
not the original cost function. Let us define the matrix 
H G Z MxN , which contains all the possible predictions 
of the training data using weak classifiers from the pool 
H. Explicitly Hij = hj(xj) is the label ({ + 1,-1}) given 
by weak classifier hj(-) on the training example Xi. We 

1. The reason why we do not write this constraint as l T u> = T will 
become clear later. 
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use Hi = [Hn Hi2 ■ ■ ■ ffjjv] to denote i-th row of H, 
which constitutes the output of all the weak classifiers on 
the training example ajj. The cost function of AdaBoost^i 
writes: 

/ m \ 
min log I ^ eyjp(—yiHjw) J , s.t. w ^ 0, 1 T w = ^. 



(5) 



We can also write the above program into 



Jog E' 



mm log | > ( exp 



T 



Algorithm 1 Stage-wise AdaBoost, and Arc-Gv. 

Input: Training set (x i: y t ), y t = {+1, -1}, i = 1 • • ■ M; 
maximum iteration A^ max . 
l Initialization: = jj, Vz = 1 ■ • • M. 
i lor j = !,-■■ ,N max do 

1) Find a new base hj(-) using the distribution vP ; 

2) Choose wy, 

3) Update u: u{ +1 oc u\ exp (—yiWjhj(xi)), Vi; and 
normalize 

AT 



st w y l T u; = 1 Output: The learned classifier F(x) — J2j=i w j^j( x ) 



(6) 

which is exactly the same as (5). In |J] the smooth margin 
that is similar but different to the logarithmic cost function, 
is used to analyze AdaBoost's convergence behavior@ 

Problem (5) (or 10) is a convex problem in w. We know 
that the log-sum-exp function lse(x) = log(X)j=i ex P x i) 
is convex 12111 . Composition with an affine mapping pre- 
serves convexity. Therefore, the cost function is convex. The 
constraints are linear hence convex too. For completeness, 
we include the description of the standard stage-wise Ad- 
aBoost and Arc-Gv in Algorithm Q] The only difference of 
these two algorithms is the way to calculate Wj (step (2) of 
Algorithm [TJ. For AdaBoost: 



Wj = - log 



1 



(7) 



where rj is the edge of the weak classifier hj(-) defined 
as r j = Y,iLi u iVihj{xi) = YaL\ u t y t H tj . Arc-Gv modifies 
10 in order to maximize the minimum margin: 



~ o log i 7~> 

2 l-Qj 



(8) 



where Qj is the minimum margin over all training exam- 
ples of the combined classifier up to the current round: 



; }, with gi = 0. Arc- 
Wj — 1 if Wj > 1 and 



Qj = minl^ELi^/is^O/Ei 

2 

Gv clips Wj into [0JL1 by setting 
Wj = if Wj < lull . Other work such as 
used different approaches to determine Qj in (8). 



2J has 



3 Lagrange Dual of Boosting Algorit- 
hms 

Our main derivations are based on a form of duality termed 
convex conjugate or Fenchel duality. 

Definition 3.1. (Convex conjugate) Let / : R™ — > R. The 
function /* : R n -> R, defined as 



/*(«)= sup (u T x-f(x)) 

x£dom / 



(9) 



is called the convex conjugate (or Fenchel duality) of the 
function /(■). The domain of the conjugate function consists 
ofuG R™ for which the supremum is finite. 

2. The smooth margin in @| is defined as 

-log(Efaiexp(-yigiuO) 



/*(•) is always a convex function because it is the pointwise 
supremum of a family of affine functions of u. This is true 
even if /(•) is non-convex j2lll . 

Proposition 3.1. (Conjugate of log-sum-exp) The conjugate 
of the log-sum-exp function is the negative entropy function, 
restricted to the probability simplex. Formally for lse(a:) = 
l°g(2i=i ex P x i)t its conjugate is: 

, */ x \J2i-i u i lo g u i> ifu^ andl T u= I; 
lse (u) = < 

I oo otherwise. 

We interpret log as 0. 

Chapter 3.3 of fzill gives this result. 

Theorem 3.1. The dual of AdaBoost a is a Shannon entropy 
maximization problem, which writes, 



u 0, l T u = 1. 



(10) 



Proof: To derive a Lagrange dual of AdaBoostfi, we 
first introduce a new variable z £ R M such that its i-th 
entry Zi = —yiHiW, to obtain the equivalent problem 



min log (j2iLi ex P z i 
s.t. m = -ytHiW (Vi = 1, • • • , M), (11) 



w I>= 0, l T ti) = ^. 



The Lagrangian £(•) associated with the problem 101 is 

/ M \ M 

L(w, z, u, q, r) = log I ^ CX P z » I _ E + Vi H i w ) 

\i=l / i=l 

-q T u;-r(l T u;-^), (12) 



with q ^= 0. The dual function is 



M 



inf L = inf log > exp Zj J — } 



Ui^i + — 



must be 



If 
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M 



inf log ( J^expzj ) - u7 ' z + — 



\i=l / 
-lsc*(xt) (see Proposition fe-U 



u T z - log ex P +-f 

^2 Ui lo & 1 



sup 

z 

M 



i=l 



r 



(13) 



By collecting all the constraints and eliminating q, the dual 
of Problem © is flOj. □ 
Keeping two variables w and 2, and introducing new 
equality constraints Zi = —yiHiW, Vi, is essential to de- 
rive the above simple and elegant Lagrange dual. Simple 
equivalent reformulations of a problem can lead to very 
different dual problems. Without introducing new variables 
and equality constraints, one would not be able to obtain 
(TO). Here we have considered the negative margin Zi to 
be the central objects of study. In (23ll , a similar idea has 
been used to derive different duals of kernel methods, 
which leads to the so-called value regnlarization. We focus 
on boosting algorithms instead of kernel methods in this 
work. Also note that we would have the following dual if 
we work directly on the cost function in {3): 



r 

max — 

T.U T 



l^ l=x u t log Ui 



1 u 



(14) 



No normalization requirement 1 t m = 1 is imposed. In- 
stead, l T it works as a regularization term. The connection 
between AdaBoost and LPBoost is not clear with this dual. 

Lagrange duality between problems (5) and ( fTOt assures 
that weak duality and strong duality hold. Weak duality 
says that any feasible solution of (flOb produces a lower 
bound of the original problem 10. Strong duality tells us 
the optimal value of ( fTOb is the same as the optimal value of 
©. The weak duality is guaranteed by the Lagrange duality 
theory. The strong duality holds since the primal problem 
10 is a convex problem that satisfies Slater's condition jzfll . 

To show the connection with LPBoost, we equivalently 
rewrite the above formulation by reversing the sign of r 
and multiplying the cost function with T, (T > 0): 



min r + TY^ x Ui log 



u Js= 0, l T u = 1. 



(15) 



Note that the constraint u !j= is implicitly enforced by 
the logarithmic function and thus it can be dropped when 
one solves tTSY 

3.1 Connection between AdaBoost, , and Gibbs free 
energy 

Gibbs free energy is the chemical potential that is min- 
imized when a system reaches equilibrium at constant 
pressure and temperature. 



Let us consider a system that has M states at temperature 
T. Each state has energy v,i and probability Ui of likelihood 
of occurring. The Gibbs free energy of this system is related 
with its average energy and entropy, namely: 

G{v, u) = u T v + TJ2Zi Ui logu*. (16) 

When the system reaches equilibrium, G(v,u) is mini- 
mized. So we have 



min G(v, u), s.t. u ^= 0, l T it = 1. 



(17) 



The constraints ensure that u is a probability distribution. 

Now let us define vector Vj with its entries being = 
UiHij . Vij is the energy associated with state i for case j. Vij 
can only take discrete binary values +1 or —1. We rewrite 
our dual optimization problem fTBT l into 



worst case energy vector v j 



M 



max{zx T t;j} +T Ui log Ui 



s.t. u 0, 1' u = 1. 



(18) 



This can be interpreted as finding the minimum Gibbs free 
energy for the worst case energy vector. 

3.2 Connection between AdaBoost, , and LPBoost 

First let us recall the basic concepts of LPBoost. The idea 
of LPBoost is to maximize the minimum margin because 
it is believed that the minimum margin plays a critically 
important role in terms of generalization error Q]. The 
hard-margin LPBoost 0] can be formulated as 

minimum margin 



max mm{yiHiw}, s.t. w fc= 0, 1 w = 1. (19) 

w i 

This problem can be solved as an LP. Its dual is also an LP: 



min r s.t. Y,iti u iyi H ij < r (Vj = 1, • • ■ , N), 

r,u 

u > 0, l u = 1. 



(20) 



Arc-Gv has been shown asymptotically to a solution of the 
above LPs (HI]. 

The performance deteriorates when no linear combina- 
tion of weak classifiers can be found that separates the 
training examples. By introducing slack variables, we get 
the soft-margin LPBoost algorithm 

max q — D1 T £ 

s.t. y t H t w > q - (Vi = 1, • • • , M), (21) 
w fc= 0, l T w = 1,£ fc= 0. 

Here D is a trade-off parameter that controls the balance 
between training error and margin maximization. The dual 
of El l is similar to the hard-margin case except that the 
dual variable u is capped: 



min r s.t. Y!iL\ u iVi^i3 <r (Vj = !,-■■ , N), 



Dl >= u )? 0, 1 1 u = 1. 



(22) 
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Comparing (15} with hard-margin LPBoost's dual, it is 
easy to see that the only difference is the entropy term in 
the cost function. If we set T = 0, (15) reduces to the hard- 
margin LPBoost. In this sense, we can view AdaBoost^i's 
dual as entropy regularized hard-margin LPBoost. Since the 
regularization coefficient T is always positive, the effects 
of the entropy regularization term is to encourage the 
distribution u as uniform as possible (the negative entropy 
UilogUi is the Kullback-Leibler distance between u 
and the uniform distribution). This may explain the un- 
derlying reason of AdaBoost's success over hard-margin 
LPBoost: To limit the weight distribution u leads to better 
generalization performance. But, why and howl We will 
discover the mechanism in Section 13.31 

When the regularization coefficient, T, is sufficiently 
large, the entropy term in the cost function dominates. 
In this case, all discrete probability u t become almost the 
same and therefore gather around the center of the simplex 
{u !>= 0, 1 t m = 1}. As T decreases, the solution will 
gradually shift to the boundaries of the simplex to find 
the best mixture that best approximates the maximum. 
Therefore, T can be also viewed as a homotopy parameter 
that bridges a maximum entropy problem with uniform 
distribution u% = 1/M (i = 1 , . . . , M), to a solution of the 
max-min problem fl9l . 

This observation is also consistent with the soft-margin 
LPBoost. We know that soft-margin LPBoost often outper- 
forms hard-margin LPBoost ©], E3- 

In the primal, it is 

usually explained that the hinge loss of soft-margin is more 
appropriate for classification. The introduction of slack 
variables in the primal actually results in box constraints 
on the weight distribution in the dual. In other words the 
£00 norm of u, HitU^, is capped. This capping mechanism 
is harder than the entropy regularization mechanism of 
AdaBoost^i. Nevertheless, both are beneficial on insepa- 
rable data. In jjjll . it is proved that soft-margin LPBoost 
actually maximizes the average of 1/D smallest margins. 

Now let us take a look at the cost function of AdaBoost 
and LPBoost in the primal. The log-sum-exp cost employed 
by AdaBoost can be viewed as a smooth approximation of 
the maximum function because of the following inequality: 



maxo, < log(53i = i expai) < maxo., + log M. 

i i 

Therefore, LPBoost uses a hard maximum (or minimum) 
function while AdaBoost uses a soft approximation of the 
maximum (minimum) function. We try to explain why 
AdaBoost's soft cost function is better than LPBoost'a^l hard 
cost function next. 



3.3 AdaBoostn controls the margin variance via maxi- 
mizing the entropy of the weights on the training exam- 
ples 

In AdaBoost training, there are two sets of weights: the 
weights of the weak classifiers w and the weights on the 

3. Hereafter, we use LPBoost to denote hard-margin LPBoost unless 
otherwise specified. 



training examples u. In the last section, we suppose that 
to limit u is beneficial for classification performance. By 
looking at the Karush-Kuhn-Tucker (KKT) conditions of the 
convex program that we have formulated, we are able to 
reveal the relationship between the two sets of weights. 
More precisely, we show how AdaBoostn (and AdaBoosj^) 
controls the margin variance by optimizing the entropy of 
weights u. 

Recall that we have to introduce new equalities Zj = 
—yiHiW,\/i in order to obtain the dual ( flOb (and (TBI). 
Obviously Z{ is the negative margin of sample Xi. Notice 
that the Lagrange multiplier u is associated with these 
equalities. Let (w*, z*) and (it*, q*, r*) be any primal and 
dual optimal points with zero duality gap. One of the KKT 
conditions tells us 



V z L(w*,z*,u*,q*,r*) = 0. 



(23) 



The Lagrangian L(-) is defined in (12) , This equation fol 
lows 

exp z: 



Ei=i ex PZi 



, Vi = l,--M. 



(24) 



Equ. (24) guarantees that u* is a probability distribution. 
Note that (24) is actually the same as the update rule 
used in AdaBoost. The optimal valu^l of the Lagrange 
dual problem dTDt , which we denote Optggj, equals to the 
optimal value of the original problem (5) (and i TTO ) due to 
the strong duality, hence Opt|gj = Opt |jqj . 
From (24), at optimality we have 

-z* = - logii* - log(Efii expz?) 
= -logu*-Opt^oJ 

= -log<4-Opt|5j, Vi = l,---M. (25) 

This equation suggests that, after convergence, the margins' 
values are determined by the weights on the training 
examples u* and the cost function's value. From (25), the 
margin's variance is entirely determined by u*: 

var{— z*} = varjlogit*} + var{Optfe} = var{logu*}. 

(26) 

We now understand the reason why capping u as LP- 
Boost does, or uniforming u as AdaBoost does can improve 
the classification performance. These two equations reveal 
the important role that the weight distribution u plays in 
AdaBoost. All that we knew previously is that the weights 
on the training examples measure how difficult an individ- 
ual example can be correctly classified. In fact, besides that, 
the weight distribution on the training examples is also a 
proxy for minimizing the margin's distribution divergence. 
From the viewpoint of optimization, this is an interesting 
finding. In AdaBoost, one of the main purposes is to control 
the divergence of the margin distribution, which may not be 

4. We believe that the only difference between AdaBoost^i and 
AdaBoost is on the optimization method employed by each algorithm. 
We conjecture that some theoretical results on AdaBoost^ j derived in 
this paper may also apply to AdaBoost. 

5. Hereafter we use the symbol Opt^ j to denote the optimal value 
of Problem (■). 
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easy to optimize directly because a margin can take a value 
out of the range [0, 1], where entropy is not applicable. 
AdaBoost's cost function allows one to do so implicitly in 
the primal but explicitly in the dual. A future research topic 
is to apply this idea to other machine learning problems. 

The connection between the dual variable u and mar- 
gins tells us that AdaBoost often seems to optimize the 
minimum margin (or average margin? We will answer this 
question in the next section.) but also it considers another 
quantity related to the variance of the margins. In the dual 
problem | [T5t , minimizing the maximum edge on the weak 
classifiers contributes to maximizing the margin. At the 
same time, minimizing the negative entropy of weights on 
training examples contributes to controlling the margin's 
variance. We make this useful observation by examining 
the dual problem as well as the KKT optimality conditions. 
But it remains unclear about the exact statistics measures 
that AdaBoost optimizes. Next section presents a complete 
answer to this question through analyzing AdaBoost's pri- 
mal optimization problem. 

We know that Arc-Gv chooses w in a different way 
from AdaBoost. Therefore Arc-Gv optimizes a different cost 
function and does not minimize the negative entropy of 
u any more. We expect that AdaBoost will have a more 
uniform distribution of u. We run AdaBoost and Arc-Gv 
with decision stumps on two datasets breast-cancer and 
australian (all datasets used in this paper are available at 
liill unless otherwise specified). Fig. [T] displays the results. 
AdaBoost indeed has a small negative entropy of u in both 
experiments, which agrees with our prediction. 

It is evident now that AdaBoosfyi controls the variance 
of margins by regularizing the Shannon entropy of the cor- 
responding dual variable u. For on-line learning algorithms 
there are two main families of regularization strategies: 
entropy regularization and regularization using squared 
Euclidean distance. A question that naturally arises here 
is: What happens if we use squared Euclidean distance to 
replace the entropy in the dual of AdaBoost^ (Tl5b ? In other 
words, can we directly minimize the variance of the dual variable 
u to achieve the purpose of controlling the variance of margins? 
We answer this question by having a look at the convex 
loss functions for classification. 

Fig. |2] plots four popular convex loss functions. It is 
shown in f26ll that as the data size increases, practically 
all popular convex loss functions are Bayes-consistent, al- 
though convergence rates and other measures of consis- 
tency may vary. In the context of boosting, AdaBoost, Logit- 
Boost and soft-margin LPBoost use exponential loss, logistic 
loss and hinge loss respectively. Here we are interested in 
the squared hinge loss. LogitBoost will be discussed in the 
next section. As mentioned, in theory, there is no particular 
reason to prefer hinge loss to squared hinge loss. Now if 
squared hinge loss is adopted, the cost function of soft- 
margin LPBoost l ETt becomes 

max q- £>£^ n & 2 , 
and the constraints remain the same as in l ETt . Its dual is 



0.78 




200 400 600 800 1000 

number of iterations 



1 .05 




200 400 600 800 1000 
number of iterations 



Fig. 1: Negative entropy of u produced by the standard AdaBoost 
and Arc-Gv at each iteration on datasets breast-cancer and australian 
respectively. The negative entropy produced by AdaBoost (black) is 
consistently lower than the one by Arc-Gv (blue). 

easily derived^ 

1 \r^M 2 

s.t. V; u ,„,,,,//,, < r (Vj = 1, ■ • ■ ,N), (27) 
u ^ 0, l T u = 1. 

We can view the above optimization problem as variance 
regularized LPBoost. In short, to minimize the variance of 
the dual variable u for controlling the margin's variance, 
one can simply replace soft-margin LPBoost's hinge loss 
with the squared hinge loss. Both the primal and dual 
problems are quadratic programs (QP) and hence can be 
efficiently solved using off-the-shelf QP solvers like MOSEK 
0, CPlex Q. 

Actually we can generalize the hinge loss into 

(max{0, 1 - yF(x)}) p . 

When p > 1, the loss is convex, p = 1 is the hinge 
loss and p = 2 is the squared hinge loss. If we use a 
generalized hinge loss [p > 1) for boosting, we end up 
with a regularized LPBoost which has the format: 

min r + ^-'(p 1- * -p- q )Y™-i «?, (28) 

r,ix 

subject to the same constraints as in (27) . Here p and q 
are dual to each other by i + i = 1. It is interesting that 

6. The primal constraint £ ^= can be dropped because it is implicitly 
enforced. 
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can also be seen as entropy regularized LPBoost; more 
precisely, Tsallis entropy |29] regularized LPBoost. 

Definition 3.2. (Tsallis entropy) Tsallis entropy is a general- 
ization of the Shannon entropy defined as 



convex loss functions 



SJu) 



(u )p 0, l T u = 1). 



(29) 



where q is a real number. In the limit as q — > 1, we have 
uf" 1 = exp((g - 1) log iti) ~ 1 + (q - l)logUj. So Si = 
— u i log u i> which is Shannon entropy. 



Tsallis entropy 12911 can also be viewed as a (/-deformation 
of Shannon entropy because S q (u) = — J2i u i l°gg u i where 
\og q (u) = - is the g-logarithm. Clearly log q (u) — > 

log(u) when q — > 1. 

In summary we conclude that although the primal prob- 
lems of boosting with different loss functions seem dis- 
similar, their corresponding dual problems share the same 
formulation. Most of them can be interpreted as entropy 
regularized LPBoost. Table [TJ summarizes the result. The 
analysis of LogitBoost will be presented in the next section. 

3.4 Lagrange dual of LogitBoost 

Thus far, we have discussed AdaBoost and its relation to 
LPBoost. In this section, we consider LogitBoost |6] from 
its dual. 

Theorem 3.2. The dual of LogitBoost is a binary relative 
entropy maximization problem, which writes 



max - - E i= i log(-«i 

r,u 1 



+ (1 + Ui ) log(l + Ui)} 

s-t. EZi u iVi H ij > r (Vj = 1, • ■ ■ , N). (30) 
We can also rewrite it into an equivalent form: 

min r + TJ2 i=1 [uj logu, + (1 - u { ) log(l - m)] 



s-t. ZZi u iyi H ij < r (Y? = !,-•• ,N). 



(31) 



The proof follows the fact that the conjugate of the logistic 
loss function logit(x) = log(l + exp — x) is 



logit*(u) 



(-it) log(-u) + (1 + u) log(l + u),0 > u > -1 
cxd, otherwise, 



with OlogO = 0. Iogit*(w) is a convex function in its 
domain. The corresponding primal is 



min Ef£i lo git(^) 

w 

s.t. Zi = y l H l w, (Vi = 1, • 



,M), 



(32) 



w 0, 1 1 



In d3Tl > , the dual variable u has a constraint 1 it ^= 0, 
which is automatically enforced by the logarithmic func- 
tion. Another difference of (31} from duals of AdaBoost and 
LPBoost etc. is that u does not need to be normalized. In 
other words, in LogitBoost the weight associated with each 
training sample is not necessarily a distribution. As in | |2"H ) 



3 
2.5 

2 
1.5 

1 

0.5 



— exponential 
■ - logistic 
-hinge 

111 squared hinge 



-2-1 1 2 

margins 

Fig. 2: Various loss functions used in classification. Exponential: 
exp — s; logistic: log(l + cxp — s); hinge: max{0, 1 — s}; squared hinge: 
(max{0, 1 — s}) 2 . Here s = yF(x). 



for AdaBoost, we can also relate a dual optimal point u* 
and a primal optimal point w* ( between < f3Tb and f32l l ) by 



1 + exp — 



■M. 



(33) 



So the margin of ccj is solely determined by u*: z* = 
log ~"' , Vi. For a positive margin (x, is correctly classi- 
fied), we must have u* < 0.5. 

Similarly, we can also use CG to solve LogitBoost. As 
shown in Algorithm [2] in the case of AdaBoost, the only 
modification is to solve a different dual problem (here we 
need to solve {3]}). 

3.5 AdaBoost, j approximately maximizes the average 
margin and minimizes the margin variance 

Before we present our main result, a lemma is needed. 

Lemma 3.1. The margin of AdaBoostn and AdaBoost fol- 
lows the Gaussian distribution. In general, the larger the 
number of weak classifiers, the more closely does the margin 
follow the form of Gaussian under the assumption that 
selected weak classifiers are uncorrelated. 



Proof: The central limit theorem [30] states that the sum 
of a set of i.i.d. random variables Xj, (i = 1 ■ • • N) is approx- 
imately distributed following a Gaussian distribution if the 
random variables have finite mean and variance. 

Note that the central limit theorem applies when each 
variable x, has an arbitrary probability distribution Qi as 
long as the mean and variance of Qi are finite. 

As mentioned, the normalized margin of AdaBoost for 
i-th example is defined as 



Qi 



(2/iX),=x hj(xi)wj)/l T w = -z l /l T w. 



(34) 



In the following analysis, we ignore the normalization 
term l T u> because it does not have any impact on the 
margin's distribution. Hence the margin q is the sum of 
N variables Wj with uij — yihj(xi)uij. It is easy to see that 
each wj follows a discrete distribution with binary values 
either m, or —wa. Therefore wa must have finite mean and 
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TABLE 1: Dual problems of boosting algorithms are entropy regularized LPBoost. 



algorithm 


loss in primal 


entropy regularized LPBoost in dual 


AdaBoost 


exponential loss 


Shannon entropy 


LogitBoost 


logistic loss 


binary relative entropy 


soft-margin £ p (p > 1) LPBoost 


generalized hinge loss 


Tsallis entropy 



variance. Using the central limit theorem, we know that the 
distribution of Qi is a Gaussian. 

In the case of discrete variables {Qi can be discrete), 
the assumption identical distributions can be substantially 
weakened I31JI . The generalized central limit theorem essen- 
tially states that anything that can be thought of as being 
made up as the sum of many small independent variables 
is approximately normally distributed. 

A condition of the central limit theorem is that the N 
variables must be independent. In the case of the number 
of weak hypotheses is finite, as the margin is expressed 
in (34}, each hj(-) is fixed beforehand, and assume all the 
training examples are randomly independently drawn, the 
variable Qi would be independent too. When the set of 
weak hypotheses is infinite, it is well known that usually 
AdaBoost selects independent weak classifiers such that 
each weak classifier makes different errors on the training 
dataset fljll . In this sense, wj might be viewed as roughly 
independent from each other. More diverse weak classifiers 
will make the selected weak classifiers less dependent^ □ 

Here we give some empirical evidence for approximate 
Gaussianity. The normal (Gaussian) probability plot is used 
to visually assess whether the data follow a Gaussian dis- 
tribution. If the data are Gaussian the plot forms a straight 
line. Other distribution types introduce curvature in the 
plot. We run AdaBoost with decision stumps on the dataset 
australian. Fig. [3] shows two plots of the margins with 50 
and 1100 weak classifiers, respectively. We see that with 50 
weak classifiers, the margin distribution can be reasonably 
approximated by a Gaussian; with 1100 classifiers, the 
distribution is very close to a Gaussian. The kurtosis of a 
ID data provides a numerical evaluation of the Gaussianity. 
We know that the kurtosis of a Gaussian distribution is 
zero and almost all the other distributions have non-zero 
kurtosis. In our experiment, the kurtosis is —0.056 for the 
case with 50 weak classifiers and —0.34 for 1100 classifiers. 
Both are close to zero, which indicates AdaBoost's margin 
distribution can be well approximated by Gaussian. 

Theorem 3.3. AdaBoostn approximately maximizes the 
urmormalized average margin and at the same time min- 
imizes the variance of the margin distribution under the 
assumption that the margin follows a Gaussian distribution. 



normal probability plot 



Proof: From 10 and 
AdaBoost?! minimizes is 



, the cost function that 



/abO) =l0g(£ 



M 

=i exp - 



(35) 
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Fig. 3: Gaussianity test for the margin distribution with 50 and 1100 
weak classifiers, respectively. A Gaussian distribution will form a 
straight line. The dataset used is australian. 



As proved in Lemma 13.11 Qi follows a Gaussian 

Q{Q] 0,cr) = y= exp- 



7. Nevertheless, this statement is not rigid. 



with mean q, variance a 1 ; and Qi = 1- We assume 

that the optimal value of the regularization parameter T is 
known a priori. 

The Monte Carlo integration method can be used to 
compute a continuous integral 

1 K 

g(x)f(x)dx~-J2f(xk), (36) 
J x fc=i 

where g(x) is a probability distribution such that 
J g(x)dx = 1 and f(x) is an arbitrary function. Xk, 
(k = 1 • • • K), are randomly sampled from the distribution 
g(x). The more samples are used, the more accurate the 
approximation is. 

(351 can be viewed as a discrete Monte Carlo approxi- 
mation of the following integral (we omit a constant term 
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log M, which is irrelevant to the analysis): 
log J G(Q] Q, <t) cxp (-|?) dp 



log 
log 



'J 2 



1 



■ exp 



ei v 27rcr * V 2cr 2 

^2 



Tog2- 



T 2T 2 



log 



er 



V V2cr V2T 



ei 

02 



(37) 



where erf(x) = ^= J* exp— s 2 ds is the Gauss error func- 
tion. The integral range is [gi , §2] . With no explicit knowl- 
edge about the integration range, we may roughly calculate 
the integral from —00 to +00. Then the last term in l l37t is 
log 2 and the result is analytical and simple 

/ab(to) = -| + ~f (38) 

This is a reasonable approximation because Gaussian dis- 
tributions drop off quickly (Gaussian is not considered a 
heavy-tailed distribution). Also this approximation implies 
that we are considering the case that the number of samples 
goes to +00. 

Consequently, AdaBoost approximately maximizes the 
cost function 

-Ub(w) = --- — . (39) 

This cost function has a clear and simple interpretation: The 
first term g/T is the unnormalized average margin and the 
second term a 2 /T 2 is the unnormalized margin variance. 
So AdaBoost maximizes the unnormalized average margin 
and also takes minimizing the unnormalized margin vari- 
ance into account. This way a better margin distribution can 
be obtained. □ 

Note that Theorem 13.31 depends on Lemma 13.11 that does 
not necessarily hold in practice. 

Theorem 13.31 is an important result in the sense that it 
tries to contribute to the open question why AdaBoost 
works so well. Much previous work intends to believe 
that AdaBoost maximizes the minimum margin. We have 
theoretically shown that AdaBoost^i optimizes the entire 
margin distribution by maximizing the mean and minimiz- 
ing the variance of the margin distribution. 

We notice that when T — ► 0, Theorem 13.31 becomes 
invalid because the Monte Carlo integration cannot approx- 
imate the cost function of AdaBoost ( f35l > well. In practice, 
T cannot approach zero arbitrarily in AdaBoost. 

One may suspect that Theorem 13.31 contradicts the ob- 
servation of similarity between LPBoost and AdaBoost as 
shown in Section 13.21 LPBoost maximizes the minimum 
margin and the dual of AdaBoost is merely an entropy 
regularized LPBoost. At first glance, the dual variable r 
in fl5t , l(20), and ((22)l should have the same meaning, i.e., 
maximum edge, which in turn corresponds to the minimum 



margin in the primal. Why average margin? To answer 
this question, let us again take a look at the optimality 
conditions. Let us denote the optimal values of l fl5l l r* and 
u* . At convergence, we have y (— r* + T J2iLi u * l°g u i) = 
OptQIB = °P t fej = P t © ■ Hence ' we have 



M 



TJ2 U * logu*-Tlog ^exp- 



El 

T 



where g* is the normalized margin for Xi. Clearly this is 
very different from the optimality conditions of LPBoost, 
which shows that r* is the minimum margin. Only when 
T — ► 0, the above relationship reduces to r* = min{^*} — 

i 

same as the case of LPBoost. 

3.6 AdaBoost-QP: Direct optimization of the margin 
mean and variance using quadratic programming 

The above analysis suggests that we can directly optimize 
the cost function f39l . In this section we show that (39} is a 
convex programming (more precisely, quadratic program- 
ming, QP) problem in the variable w if we know all the base 
classifiers and hence it can be efficiently solved. Next we 
formulate the QP problem in detail. We call the proposed 
algorithm AdaBoost- QP@ 

In kernel methods like SVMs, the original space X is 
mapped to a feature space T. The mapping function $(•) 
is not explicitly computable. It is shown in fl3] that in 
boosting, one can think of the mapping function $(■) being 
explicitly known: 



$(sc) 



x 



[hi(x), ■■ ■ , h N (x)] 



(40) 



using the weak classifiers. Therefore, any weak classifier 
set H spans a feature space T . We can design an algorithm 
that optimizes |39}: 



\w T Aw - Tl7w, s.t. w fc= 0, l T w = 1, 



(41) 



where b = ± YhLx Vi H J = JT E^i ViH x i\ and A = 
TiEtUy.Hj b)( yi Hj - b)^ = £E,=i(l/i*(*i) - 
b)(yi$(xi) — b) T Clearly A must be positive semidefinite 
and this is a standard convex QP problem. The non- 
negativeness constraint w )p= introduces sparsity as 
in SVMs. Without this constraint, the above QP can be 
analytically solved using eigenvalue decomposition — the 
largest eigenvector is the solution. Usually all entries of 
this solution would be active (non-zero values). 
In the kernel space, 



[x, - 



can be viewed as the projected l\ norm distance between 
two classes because typically this value is positive assum- 
ing that each class has the same number of examples. 

8. In (32l , the authors proposed QP rog -AdaBoost for soft-margin 
AdaBoost learning, which is inspired by SVMs. Their QP rog -AdaBoost 
is completely different from ours. 

9. To show the connection of AdaBoost-QP with kernel methods, we 
have written &(xi) = Hj . 
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The matrix A approximately plays a role as the total scatter 
matrix in kernel linear discriminant analysis (LDA). Note 
that AdaBoost does not take the number of examples in 
each class into consideration when it models the problem. 
In contrast, LDA (kernel LDA) takes training example 
number into consideration. This may explain why an LDA 
post-processing on AdaBoost gives a better classification 
performance on face detection l33ll , which is a highly 
imbalanced classification problem. This observation of sim- 
ilarity between AdaBoost and kernel LDA may inspire new 
algorithms. We are also interested in developing a CG based 
algorithm for iteratively generating weak classifiers. 

3.7 AdaBoost-CG: Totally corrective AdaBoost using 
column generation 

The number of possible weak classifiers may be infinitely 
large. In this case it may be infeasible to solve the optimiza- 
tion exactly. AdaBoost works on the primal problem directly 
by switching between the estimating weak classifiers and 
computing optimal weights in a coordinate descent way. 
There is another method for working out of this problem by 
using an optimization technique termed column generation 
(CG) fioll , l34ll . CG mainly works on the dual problem. The 
basic concept of the CG method is to add one constraint at a 
time to the dual problem until an optimal solution is identi- 
fied. More columns need to be generated and added to the 
problem to achieve optimality. In the primal space, the CG 
method solves the problem on a subset of variables, which 
corresponds to a subset of constraints in the dual. When 
a column is not included in the primal, the corresponding 
constraint does not appear in the dual. That is to say, a 
relaxed version of the dual problem is solved. If a constraint 
absent from the dual problem is violated by the solution to 
the restricted problem, this constraint needs to be included 
in the dual problem to further restrict its feasible region. In 
our case, instead of solving the optimization of AdaBoost 
directly, one computes the most violated constraint in ( Tl5t 
iteratively for the current solution and adds this constraint 
to the optimization problem. In theory, any column that 
violates dual feasibility can be added. To do so, we need 
to solve the following subproblem: 

h'(-) = argmax Y,i=i u iViH x i)' ( 42 ) 
h(-) 

This strategy is exactly the same as the one that stage- 
wise AdaBoost and LPBoost use for generating the best 
weak classifier. That is, to find the weak classifier that 
produces minimum weighted training error. Putting all the 
above analysis together, we summarize our AdaBoost-CG 
in Algorithm [2] 

The CG optimization (Algorithm [2]l is so general that 
it can be applied to all the boosting algorithms consider 
in this paper by solving the corresponding dual. The con- 
vergence follows general CG algorithms, which is easy to 
establish. When a new h'(-) that violates dual feasibility 
is added, the new optimal value of the dual problem 
(maximization) would decrease. Accordingly, the optimal 
value of its primal problem decreases too because they have 



10 

the same optimal value due to zero duality gap. Moreover 
the primal cost function is convex, therefore eventually 
it converges to the global minimum. A comment on the 
last step of Algorithm [2] is that we can get the value 
of w easily. Primal-dual interior-point (PD-IP) methods 
work on the primal and dual problems simultaneously and 
therefore both primal and dual variables are available after 
convergence. We use MOSEK {zh , which implements PD-IP 
methods. The primal variable w is obtained for free when 
solving the dual problem QSJ. 

The dual subproblem we need to solve has one constraint 
added at each iteration. Hence after many iterations solving 
the dual problem could become intractable in theory. In 
practice, AdaBoost-CG converges quickly on our tested 
datasets. As pointed out in (H], usually only a small num- 
ber of the added constraints are active and those inactive 
ones may be removed. This strategy prevents the dual 
problem from growing too large. 

AdaBoost-CG is totally-corrective in the sense that the 
coefficients of all weak classifiers are updated at each 
iteration. In ]36], an additional correction procedure is 
inserted to AdaBoost's weak classifier selection cycle for 
achieving totally-correction. The inserted correction proce- 
dure aggressively reduces the upper bound of the training 
error. Like AdaBoost, it works in the primal. In contrast, our 
algorithm optimizes the regularized loss function directly 
and mainly works in the dual space. In 13711 , a totally- 
corrective boosting is proposed by optimizing the entropy, 
which is inspired by 1 1 8f| . As discussed, no explicit primal- 
dual connection is established. That is why an LPBoost 
procedure is needed over the obtained weak classifiers in 
order to calculate the primal variable w. In this sense, 1 37] 
is also similar to the work of 13211 . 

The following diagram summarizes the relationships that 
we have derived on the boosting algorithms that we have 
considered. 

AdaBoost^ primal AdaBoostCG > AdaBoost^ dual 

Lagrange duality 

Theorem l3"3lj entropy J regularization 

AdaBoost-QP LPBoost dual 

4 Experiments 

In this section we provide experimental results to verify the 
presented theory. We have mainly used decision stumps as 
weak classifiers due to its simplicity and well-controlled 
complexity. In some cases, we have also used one of the 
simplest linear classifiers, LDA, as weak classifiers. To 
avoid the singularity problem when solving LDA, we add 
a scaled identity matrix 10 _4 I to the within-class matrix. 
For the CG optimization framework, we have confined 
ourself to AdaBoost-CG although the technique is general 
and applicable for optimizing other boosting algorithms. 

4.1 AdaBoost-QP 

We compare AdaBoost-QP against AdaBoost. We have used 
14 benchmark datasets 12511 . Except mushrooms, svmguidel, 
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Algorithm 2 AdaBoost-CG. 



Input: Training set (as,, y%),i = 1- • ■ M; termination 
threshold e > 0; regularization parameter T; 
(optional) maximum iteration N max . 

Initialization: 

1) N = (no weak classifiers selected); 

2) w = (all primal coefficients are zeros); 

3) Ui = 4j, i = 1 • • ■ M (uniform dual weights), 
while true do 

1) Find a new base h'(-) by solving Problem 

2) Check for optimal solution: 
if Yli=i u iyih'{xi) < r + e, then break (problem 
solved); 

3) Add h'{ 
corresponds to a new constraint in the dual; 

4) Solve the dual to obtain updated r and it, 
(i = 1, • ■ ■ , M); for AdaBoost, the dual is <15t ; 

5) = + 1 (weak classifier count); 

6) (optional) if A^ > N max , then break (maximum 
iteration reached). 

Output: 



0.6 



0.2 




-AdaBoost 
- AdaBoost-QP 



0.8 



Fig. 4: Cumulative margins for AdaBoost, AdaBoost-QP and Arc-Gv 
to the restricted master problem, which^ or uie breast cancer dataset using decision stumps. Overall, the margin 

distribution of AdaBoost-QP is the best and it has a smallest test 
error. AdaBoost and Arc-Gv run 600 rounds of boosting. Test error for 
AdaBoost, AdaBoost-QP and Arc-Gv is 0.029, 0.027, 0.058 respectively. 



has a larger minimum margin than AdaBoost-QP, This 
confirms that the minimum margin is not crucial for the 
generalization error. On the other hand, the average margin 



, ,. ,.. produced by AdaBoost-QP, which is the first term of the 

1) Calculate the primal variable w from the optimahty r J . . . 

cost function ( I39D . is consistently larger than the one ob- 



conditions and the last solved dual problem; 



2) The learned classifier F(x) = Y^jL 



w 



jhj(x). 



svmguide3 and wla, all the other datasets have been scaled 
to [—1,1]. We randomly split each dataset into training, 
cross-validation and test sets at a ratio of 70 : 15 : 15. 

The stopping criterion of AdaBoost is determined by 
cross-validation on {600, 800, 1000, 1200, 1500} rounds of 
boosting. For AdaBoost-QP, the best value for the parameter 
T is chosen from {i i JL, i i ±j, -L} by cross- 
validation. In this experiment, decision stumps are used 
as the weak classifier such that the complexity of the base 
classifiers is well controlled. 

AdaBoost-QP must access all weak classifiers a priori. 
Here we run AdaBoost-QP on the 1500 weak classifiers 
generated by AdaBoost. Clearly this number of hypotheses 
may not be optimal. Theoretically the larger the size of the 
weak classifier pool is, the better results AdaBoost-QP may 
produce. Table [2] reports the results. The experiments show 
that among these 14 datasets, AdaBoost-QP outperforms 
AdaBoost on 9 datasets in terms of generalization error. On 
mushrooms, both perform very well. On the other 4 datasets, 
AdaBoost is better. 

We have also computed the normalized version of the 
cost function value of i39l l. In most cases AdaBoost-QP 
has a larger value. This is not surprising since AdaBoost- 
QP directly maximizes $39l while AdaBoost approximately 
maximizes it. Furthermore, the normalized loss function 
value is close to the normalized average margin because 
the margin variances for most datasets are very small 
compared with their means. 

We also compute the largest minimum margin and aver- 
age margin on each dataset. On all the datasets AdaBoost 



tained by AdaBoost. Indirectly, we have shown that a better 
overall margin distribution is more important than the 
largest minimum margin. In Fig. [4] we plot cumulative mar- 
gins for AdaBoost-QP and AdaBoost on the breast-cancer 
dataset with decision stumps. We can see that while Arc- 
Gv has a largest minimum margin, it has a worst margins 
distribution overall. If we examine the average margins, 
AdaBoost-QP is the largest; AdaBoost seconds and Arc-Gv 
is least. Clearly a better overall distribution does lead to a 
smaller generalization error. When Arc-Gv and AdaBoost 
run for more rounds, their margin distributions seem to 
converge. That is what we see in Fig. [4] These results agree 
well with our theoretical analysis (Theorem 13. 3t . Another 
observation from this experiment is that, to achieve the 
same performance, AdaBoost-QP tends to use fewer weak 
classifiers than AdaBoost does. 

We have also tested AdaBoost-QP on full sets of weak 
classifiers because the number of possible decision stumps 
is finite (less than (number of features —1) x (number of 
examples)). Table [3] reports the test error of AdaBoost-QP 
on some small datasets. As expected, in most cases, the test 
error is slightly better than the results using 1500 decision 
stumps in TableHJ and no significant difference is observed. 
This verifies the capability of AdaBoost-QP for selecting 
and combining relevant weak classifiers. 

4.2 AdaBoost-CG 

We run AdaBoost and AdaBoost-CG with decision stumps 
on the datasets of (2^1 . 70% of examples are used for 
training; 15% are used for test and the other 15% are 
not used because we do not do cross-validation here. The 
convergence threshold for AdaBoost-CG (e in Algorithm 
13 is set to 10 -5 . Another important parameter to tune is 
the regularization parameter T. For the first experiment, 
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TABLE 2: Test results of AdaBoost (AB) and AdaBoost-QP (QP). All tests are run 10 times. The mean and standard deviation are reported. 
AdaBoost-QP outperforms AdaBoost on 9 datasets. 



dataset 


algorithm 


test error 


minimum margin 


average margin 


australian 


AB 




—0 012 + 005 


082 4- flOfi 

U.UOii _l_ VJ . uuu 




QP 


0.13 ± 0.038 


-0.227 ± 0.081 


0.18 ± 0.052 




AB 


n rui xiini'i 


n n4S + n 009 


n 209 + 02 




QP 


0.03 ± 0.012 


-0.424 ± 0.250 


0.523 ±0.237 


diabetes 


AB 


0.270 ± 0.043 


—0.038 ± 0.007 


0.055 ± 0.005 




QP 


0.262 ± 0.047 


-0.107 ± 0.060 


0.075 ± 0.031 


fonrr1a*s*5 

1 W U ] l_ 1 Li 7j i7 


AB 


0.088 ± 0.032 


—0.045 ± 0.012 


084 + 009 




QP 


0.095 ± 0.028 


-0.211 ± 0.059 


0.128 + 0.027 


g-numer 


AB 


283 + 033 


—0.079 ± 0.017 


042 + 006 


QP 


0.249 ± 0.033 


-0.151 ± 0.058 


0.061 ± 0.020 


heart 


AB 


210 + 032 


0.02 ± 0.008 


104 + 013 




QP 


0.190 ± 0.058 


-0.117 ± 0.066 


0.146 + 0.059 


ionosphere 


AB 


121 +0 044 


0101+0010 


f) 1 fi5 + 01 2 




QP 


0.139 ± 0.055 


-0.035 ± 0.112 


0.184 ± 0.063 


^. yer 


AB 


n Q91 + 040 


—0 012 + 007 


\J.\J\J<J _1_ VI • \J\)ij 




QP 


0.314 ± 0.060 


-0.107 ± 0.044 


0.079 ± 0.021 


mil Qtirnnmft 

11 IU3111UU11I3 


AB 


± 


1 02 + 001 


nisi + n noi 

U.lUl _1_ \J.\J\Ji- 




QP 


0.005 ± 0.002 


-0.134 ± 0.086 


0.221 ± 0.084 


sonaI 


AB 


n 1 45 + n 046 


1 56 + 008 


n 202 + 0013 




QP 


0.171 ±0.048 


0.056 ± 0.066 


0.220 + 0.045 


splice 


AB 


0.129 ±0.025 


-0.009 ±0.008 


0.117 + 0.009 


QP 


0.106 ±0.029 


-0.21 ±0.037 


0.189 + 0.02 


svmguidel 


AB 


0.035 ±0.009 


0.010 ±0.008 


0.157 + 0.016 




QP 


0.040 ± 0.009 


-0.439 ± 0.183 


0.445 + 0.155 


svmguide3 


AB 


0.172 ±0.023 


0.011 ±0.009 


0.052 + 0.005 




QP 


0.167 ± 0.022 


-0.113 ± 0.084 


0.085 + 0.038 


wla 


AB 


0.041 ±0.014 


-0.048 ±0.010 


0.084 ± 0.005 




QP 


0.029 ±0.009 


-0.624 ±0.38 


0.577 + 0.363 



TABLE 3: Test results of AdaBoost-QP on full sets of decision stumps. All tests are run 10 times. 



dataset 


australian 


b-cancer 


fourclass 


g-numer 


heart 


liver 


mushroom 


splice 


test error 


0.131 ± 0.041 


0.03 ±0.011 


0.091 ±0.02 


0.243 ±0.026 


0.188 ±0.058 


0.319 ±0.05 


0.003 ±0.001 


0.097 ±0.02 



we have set it to 1/ TJw where w is obtained by running 
AdaBoost on the same data for 1000 iterations. Also for fair 
comparison, we have deliberately forced AdaBoost-CG to 
run 1000 iterations even if the stopping criterion is met. 
Both test and training results for AdaBoost and AdaBoost- 
CG are reported in Table H] for a maximum number of 
iterations of 100, 500 and 1000. 

As expected, in terms of test error, no algorithm statis- 
tically outperforms the other one, since they optimize the 
same cost function. As we can see, AdaBoost does slightly 
better on 6 datasets. AdaBoost-CG outperforms AdaBoost 
on 7 datasets and on svmguidel, both algorithms perform 
almost identically. Therefore, empirically we conclude that 
in therms of generalization capability, AdaBoost-CG is the 
same as the standard AdaBoost. 

However, in terms of training error and convergence 
speed of the training procedure, there is significant differ- 
ence between these two algorithms. Looking at the right 
part of Table S we see that the training error of AdaBoost- 
CG is consistently better or no worse than AdaBoost on all 
tested datasets. We have the following conclusions. 

• The convergence speed of AdaBoost-CG is faster than 
AdaBoost and in many cases, better training error can 
be achieved. This is because AdaBoost's coordinate 
descent nature is slow while AdaBoost-CG is totally 



correctiv^. This also means that with AdaBoost-CG, 
we can use fewer weak classifiers to build a good 
strong classifier. This is desirable for real-time appli- 
cations like face detection j 3of| . in which the testing 
speed is critical. 
> Our experiments confirm that a smaller training error 
does not necessarily lead to a smaller test error. This 
has been studied extensively in statistical learning 
theory. It is observed that AdaBoost sometimes suffers 
from overfitting and minimizing the exponential cost 
function of the margins does not solely determine test 
error. 

In the second experiment, we run cross-validation to se- 
lect the best value for the regularization parameter T, same 
as in Section l4~Tl Table |5] reports the test errors on a subset of 
the datasets. Slightly better results are obtained compared 
with the results in Table HJ which uses T determined by 
AdaBoost. 

We also use LDA as weak classifiers to compare the 
classification performance of AdaBoost and AdaBoost-CG. 
The parameter T of AdaBoost-CG is determined by cross- 
validation from {j, 5, g, yjj, yj, J5, 20, 3q, 4q, 5q, 7q, 9Q, 

Too' T20' 150 }• F° r AdaBoost the smallest test error from 
100, 500 and 1000 runs is reported. We show the results in 
Table [6] As we can see, the test error is slightly better than 

10. Like LPBoost, at each iteration AdaBoost-CG updates the previ- 
ous weak classifier weights w. 
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TABLE 4: Test and training errors of AdaBoost (AB) and AdaBoost-CG (CG). All tests are run 5 times. The mean and standard deviation are 
reported. Weak classifiers are decision stumps. 



dataset 


algorithm 


test error 100 


test error 500 


test error 1000 


train error 100 


train error 500 


train error 1000 


australian 


AB 
CG 


0.146 ±0.028 

0.177 ±0.025 


0.165 ± 0.018 

0.167 ± 0.023 


0.163 ±0.021 

0.167 ±0.023 


0.091 ±0.013 
0.013 ±0.008 


0.039 ± 0.011 
0.011 ± 0.007 


0.013 ±0.009 
0.011 ± 0.007 


b-cancer 


AB 
CG 


0.041 ± 0.026 

0.049 ± 0.033 


0.045 ± 0.030 

0.049 ± 0.033 


0.047 ± 0.032 

0.049 ± 0.033 


0.008 ±0.006 
0±0 


0±0 
0±0 


0±0 
0±0 


diabetes 


AB 
CG 


0.254 ± 0.024 

0.270 ± 0.047 


0.263 ±0.028 
0.254 ±0.026 


0.257 ±0.041 
0.254 ±0.026 


0.171 ±0.012 
0.083 ±0.008 


0.120 ±0.007 
0.070 ±0.007 


0.082 ± 0.006 
0.070 ±0.007 


fourclass 


AB 
CG 


0.106 ±0.047 
0.082 ±0.031 


0.097 ± 0.034 
0.082 ± 0.031 


0.091 ±0.031 
0.082 ± 0.031 


0.072 ±0.023 
0.042 ±0.015 


0.053 ±0.017 
0.042 ± 0.015 


0.046 ±0.017 
0.042 ± 0.015 


g-numer 


AB 
CG 


0.279 ± 0.043 
0.269 ±0.040 


0.288 ± 0.048 
0.262 ± 0.045 


0.297 ±0.051 
0.262 ± 0.045 


0.206 ±0.047 
0.142 ±0.077 


0.167 ±0.072 
0.142 ± 0.077 


0.155 ±0.082 
0.142 ± 0.077 


heart 


AB 
CG 


0.175 ±0.073 
0.165 ±0.072 


0.175 ±0.088 
0.165 ± 0.072 


0.165 ±0.076 
0.165 ± 0.072 


0.049 ± 0.022 
0±0 


0±0 
0±0 


0±0 
0±0 


ionosphere 


AB 
CG 


0.092 ±0.016 

0.131 ±0.034 


0.104 ±0.017 

0.131 ± 0.034 


0.100 ±0.016 

0.131 ±0.034 


0±0 
± 


0±0 
0±0 


0±0 
0±0 


liver 


AB 
CG 


0.288 ±0.101 
0.288 ±0.084 


0.265 ± 0.081 

0.288 ± 0.084 


0.281 ± 0.062 

0.288 ±0.084 


0.144 ±0.018 
0.017± 0.012 


0.063 ± 0.015 
0.017± 0.011 


0.020 ±0.015 
0.017± 0.011 


mushrooms 


AB 
CG 


± 0.001 
± 


0± 0.001 
± 


0±0 
0±0 


± 
± 


0±0 
0±0 


0±0 
0±0 


sonar 


AB 
CG 


0.206 ±0.087 

0.232 ±0.053 


0.213 ±0.071 

0.245 ± 0.078 


0.206 ±0.059 

0.245 ± 0.078 


0±0 
0±0 


0±0 
0±0 


0±0 
0±0 


splice 


AB 
CG 


0.129 ±0.011 

0.161 ±0.033 


0.143 ±0.026 

0.151 ± 0.023 


0.143 ±0.020 

0.151 ±0.023 


0.053 ±0.003 
0.002 ±0.002 


0.008 ± 0.006 
0.001 ± 0.002 


0.001 ±0.001 
0.001 ± 0.002 


svmguidel 


AB 
CG 


0.036 ±0.012 

0.037 ±0.007 


0.034 ±0.008 

0.037 ± 0.007 


0.037 ±0.007 
0.037 ±0.007 


0.022 ±0.002 
0.001 ± 0.001 


0.009 ± 0.002 
± 0.001 


0.002 ±0.001 
± 0.001 


svmguide3 


AB 
CG 


0.184 ±0.037 
0.184 ± 0.026 


0.183 ±0.044 
0.171 ± 0.023 


0.182 ±0.031 
0.171 ± 0.023 


0.112 ±0.009 
0.033 ±0.012 


0.037 ± 0.004 
0.023 ±0.016 


0.009 ± 0.003 
0.023 ±0.016 


wla 


AB 
CG 


0.051 ±0.009 
0.018 ±0.001 


0.038 ±0.005 
0.018 ±0.001 


0.036 ± 0.004 
0.018 ±0.001 


0.045 ±0.008 
0.010 ±0.004 


0.028 ± 0.005 
0.010 ±0.004 


0.025 ±0.005 
0.010 ±0.004 




Fig. 5: Test error and training error of AdaBoost, AdaBoost-CG for australian, breast-cancer, diabetes, heart, spline and svmguide3 datasets. These 
convergence curves correspond to the results in Table [4] The x-axis is on a logarithmic scale for easier comparison. 



with decision stumps for both AdaBoost and AdaBoost- 
CG. Again, AdaBoost and AdaBoost-CG's performances are 
very similar. 

In order to show that statistically there are no difference 
between AdaBoost-CG and AdaBoost, the McNemar test 
l39ll with the significance level of 0.05 is conducted. Mc- 



Nemar's test is based on a x 2 test HI]- If the quantity 
of the x 2 test is not greater than \\ 95 = 3.84 1 459, 
we can think of that the two tested classifiers have no 
statistical difference in terms of classification capability. On 
the 8 datasets with decision stumps and LDA (Tables [5] and 
[6}, in all cases (5 runs per dataset), the results of x 2 test are 
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TABLE 5: Test error of AdaBoost-CG with decision stumps, using cross-validation to select the optimal T. All tests are run 5 times, 
dataset australian b-cancer diabetes fourclass heart ionosphere sonar splice 

test error 0.146 ±0.027 0.033 ± 0.033 0.266 ± 0.036 0.086 ± 0.027 0.17 ±0.082 0.115 ±0.024 0.2 ± 0.035 0.135 ±0.015 



smaller than Xio95- Consequently, we can conclude that 
indeed AdaBoost-CG performs very similarly to AdaBoost 
for classification. 

To examine the effect of parameter T, we run more exper- 
iments with various T on the banana dataset (2D artificial 
data) that was used in HH]. We still use decision stumps. 
The maximum iteration is set to 400. All runs stop earlier 
than 100 iterations. Table [7] reports the results. Indeed, 
the training error depends on T. T also has influence on 
the convergence speed. But, in a wide range of T, the 
test error does not change significantly. We do not have a 
sophisticated technique to tune T. As mentioned, the sum 
of w from a run of AdaBoost can serve as a heuristic. 

Now let us take a close look at the convergence behavior 
of AdaBoost-CG. Fig. [5] shows the test and training error 
of AdaBoost and AdaBoost-CG for 6 datasets. We see that 
AdaBoost-CG converges much faster than AdaBoost in 
terms of number of iterations. On most tested datasets, 
AdaBoost-CG is around 10 times faster than AdaBoost. 
The test error for these two methods are very close upon 
convergence. In some datasets such as australian and breast- 
cancer we observe over-fitting for AdaBoost. 

4.3 LogitBoost-CG 

We have also run LogitBoost-CG on the same datasets. All 
the settings are the same as in the case of AdaBoost-CG. 
The weak classifiers are decision stumps. Table [8] reports 
the experiment results. Compared to Table |5j very similar 
results have been observed. No one achieves better results 
over the other one on all the datasets. 

5 Discussion and Conclusion 

In this paper, we have shown that the Lagrange dual 
problems of AdaBoost, LogitBoost and soft-margin LPBoost 
with generalized hinge loss are all entropy regularized 
LPBoost. We both theoretically and empirically demon- 
strate that the success of AdaBoost relies on maintaining a 
better margin distribution. Based on the dual formulation, a 
general column generation based optimization framework 
is proposed. This optimization framework can be applied to 
solve all the boosting algorithms with various loss functions 
mentioned in this paper. Experiments with exponential loss 
show that the classification performance of AdaBoost-CG 
is statistically almost identical to the standard stage-wise 
AdaBoost on real datasets. In fact, since both algorithms 
optimize the same cost function, we would be surprised to 
see a significant different in their generalization error. The 
main advantage of the proposed algorithms is significantly 
faster convergence speed. 

Compared with the conventional AdaBoost, a drawback 
of AdaBoost-CG is the introduction of a parameter, same as 
in LPBoost. While one can argue that AdaBoost implicitly 



determines this same parameter by selecting how many 
iterations to run, the stopping criterion is nested and thus 
efficient to learn. In the case of AdaBoost-CG, it is not 
clear how to efficiently learn this parameter. Currently, one 
has to run the training procedure multiple times for cross 
validation. 

With the optimization framework established here, some 
issues on boosting that are previously unclear may become 
obvious now. For example, for designing cost-sensitive 
boosting or boosting on uneven datasets, one can simply 
modify the primal cost function to have a weighted cost 
function |40f| . The training procedure follows AdaBoost-CG. 

To summarize, the convex duality of boosting algorithms 
presented in this work generalizes the convex duality in 
LPBoost. We have shown some interesting properties that 
the derived dual formation possesses. The duality also 
leads to new efficient learning algorithms. The duality pro- 
vides useful insights on boosting that may lack in existing 
interpretations Q], @]. 

In the future, we want to extend our work to boos ting 
with non-convex loss functions such as BrownBoost 14111 . 
Also it should be straightforward to optimize boosting 
for regression using column generation. We are currently 
exploring the application of AdaBoost-CG to efficient object 
detection due to its faster convergence, which is more 
promising for feature selection (jf^ . 

Appendix A 

Description of datasets 

TableU] provides a description of the datasets we have used 
in the experiments. 
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TABLE 8: Test error of LogitBoost-CG with decision stumps, using cross-validation to select the optimal T. All tests are run 5 times. 



dataset 


australian 


b-cancer 


diabetes 


fourclass 


heart 


ionosphere sonar 


splice 


test error 


0.13 ±0.043 


0.039 ±0.012 


0.238 ± 0.057 


0.071 ± 0.034 


0.14 ±0.095 


0.2 ±0.069 0.169 ±0.05 


0.104 ±0.021 



TABLE 9: Description of the datasets. Except mushrooms, svmguidel, svmguide3 and wla, all the other datasets have been scaled to [—1, 1]. 



dataset 


# examples 


# features 


dataset 


# examples 


# features 


australian 


690 


14 


liver-disorders 


345 


6 


breast-cancer 


683 


10 


mushrooms 


8124 


112 


diabetes 


768 


8 


sonar 


208 


60 


fourclass 


862 


2 


splice 


1000 


60 


german-numer 


1000 


24 


svmguidel 


3089 


4 


heart 


270 


13 


svmguide3 


1243 


22 


ionosphere 


351 


34 


wla 


2477 


300 
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